Skip to content

[Variant] Fix the variant shred logic#10157

Open
klion26 wants to merge 3 commits into
apache:mainfrom
klion26:fix_variant_shred
Open

[Variant] Fix the variant shred logic#10157
klion26 wants to merge 3 commits into
apache:mainfrom
klion26:fix_variant_shred

Conversation

@klion26

@klion26 klion26 commented Jun 18, 2026

Copy link
Copy Markdown
Member

Which issue does this PR close?

What changes are included in this PR?

  • Seperate the logic for variant_shred and variant_get by introduce shred flag for row builder
  • Move some logic in Variant::as_xxx to type_conversion, and these moved code will be used when variant_get, variant_shred will use Variant::as_xxx
  • After the change, when shredding a variant, all Variant::as_xx() is identity function now, Variant::Int8 can only be treated as int8, but not int16/int32/int64, etc.
  • Removed the Variant::as_f16
  • Add a test to cover that shred can/can't be shredded to some datatype in test_variant_type_shredded_correctly

Are these changes tested?

Yes, added some tests to cover the logic

Are there any user-facing changes?

Yes, some Variant::as_xx logic have been changed.

@github-actions github-actions Bot added the parquet-variant parquet-variant* crates label Jun 18, 2026
@klion26 klion26 force-pushed the fix_variant_shred branch from 32a3f09 to aca457f Compare June 18, 2026 06:39
Previously, variant_shred and variant_get reuse the same code path,
causing type mismatches (e.g., bool was incorrectly shredded as int).
This commit separates the shred path and ensures each type is converted correctly.
@klion26 klion26 force-pushed the fix_variant_shred branch from aca457f to 003214c Compare June 18, 2026 06:55

@klion26 klion26 left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich @sdf-jkl Please help review this when you're free, thanks.

impl_primitive_from_variant!(datatypes::Float32Type, as_f32);
impl_primitive_from_variant!(datatypes::Float64Type, as_f64);
impl_primitive_from_variant!(datatypes::Date32Type, as_naive_date, |v| {
enum NumericKind {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These code are mainly moved from Variant.rs, the change here was adding the parameter shred for from_variant and variant_to_unscaled_decimal.

@sdf-jkl sdf-jkl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @klion26 I took a look.

Overall I like the approach but I think we can improve it.

To follow more closely the existing split between arrow-array and arrow-cast we can only keep identity accessors for just the specific type without widening/narrowing. (i.e. Variant::as_i32 only if the Variant is i32) (it'd also address the lossy cast comment below)

And as for switching between shred_variant and variant_get cast semantics, we could pass them through variant_cast_with_options and add an argument for cast semantics we use { shred, get }

/// [specification]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
/// [Variant Shredding specification]: https://github.com/apache/parquet-format/blob/master/VariantShredding.md
///
/// # Casting Semantics

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally thought it's a wrong idea to decouple cast from parquet-variant crate, but now I think it's very similar to how arrow-array and arrow-cast are decoupled.

cast_options,
capacity,
NullValue::ArrayElement,
false,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
false,
true,

if we keep false here it preserves the old too lenient cast semantics.

The branch explicitly checks if shredded is true.

Can check with (doesn't pass):

    // `VariantToListArrowRowBuilder::try_new` hardcodes `shred = false` for its element
    // builder (see variant_to_arrow.rs). The struct-field path threads `shred` correctly, so this
    // asymmetry is invisible without a nested-list test.
    #[test]
    fn test_array_shredding_does_not_cast_bool_element_to_int() {
        let input = build_variant_array(vec![VariantRow::List(vec![
            VariantValue::from(1i64),
            VariantValue::from(true),
            VariantValue::from(2i64),
        ])]);
        let list_schema = DataType::List(Arc::new(Field::new("item", DataType::Int64, true)));
        let result = shred_variant(&input, &list_schema).unwrap();
        assert_eq!(result.len(), 1);

        assert_list_structure_and_elements::<Int64Type, i32>(
            &result,
            1,
            &[0, 3],
            &[Some(3)],
            &[None],
            (
                // element 1 (the bool) must NOT be shredded into typed_value ...
                &[Some(1), None, Some(2)],
                // ... it must be preserved verbatim in the residual `value` instead.
                &[None, Some(Variant::BooleanTrue), None],
            ),
        );
    }

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As, thanks for pointing this out, this was my bad, I used false as a placeholder, but did not change me all.

cast_options,
capacity,
NullValue::ArrayElement,
false,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
false,
true,

same as above

Comment thread parquet-variant/src/variant.rs Outdated
self.as_num()
match *self {
Variant::Float(i) => Some(i),
Variant::Double(i) => Some(i as f32),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should shred do lossy narrowing f64 to f32?

@klion26 klion26 left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdf-jkl Thanks for the review, I've addressed the comments, please take another look when you're free.

For the identity get for Variant::as_xx, I have thought that, but keep it as now, because the underlying physical data is the same(don't do lossless conversion), but I'm open for this question.

For the switching between shred_variant and variant_get, I left the switch logic to the definition of the builder -- here we know how to do the conversion better, variant_cast_with_options seems like a wrapper of them, so I keep as it for now.

@klion26 klion26 force-pushed the fix_variant_shred branch from 5032777 to 55c4a7d Compare June 24, 2026 09:47
@klion26

klion26 commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

After another look of the code, I perfer the identity function for Variant::as_xx now, as some double can't convert to float even located in [f32::MIN, f32::MAX], this makes the semantics of Variant::as_f32 tricky(users may not know which double(f64) can or can't be convert to f32 easily), -- for shred, this may led to some double (in the same array) shred to f32 into typed_value column, and some double(which can't convert into f32 lossless) located into value column.

After changing to the identity function for Variant::as_xxx, there is a problem for Variant::as_u8/16/32/64, how do we want to handle this?
if we align the semantics for all of the Variant::as_xxx, maybe we can only return Some(value) for Variant::as_u8 when it is Variant::Int8 and None for the other variants, etc

fn as_u8(&self) -> Option<u8> {
   match *self {
     Variant::Int8(i) => i.try_into().ok(),
     _ => None,
   }
}

...

fn as_u64(&self) -> Option<u64> {
    match *self {
       Variant::Int64(i) => i.try_into().ok(),
      _ => None,
}

@sdf-jkl

sdf-jkl commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

If we go identity functions for Variant::as_xx we can drop Variant::as_u8/16/32/64 from parquet-variant entirely and keep them only as cast semantics in parquet-variant-compute?

We don't shred to unsigned types and the only path to use it would be cast in variant_get, so parquet-variant can safely drop them.

@klion26 klion26 force-pushed the fix_variant_shred branch from a2c254c to 03060f9 Compare June 25, 2026 11:36
@klion26 klion26 force-pushed the fix_variant_shred branch from 03060f9 to d370f9b Compare June 25, 2026 11:45
@klion26

klion26 commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

I've update the code,

  • changed the Variant::as_xx as identity function
  • removed Variant::as_f16, -- will always return None for as_f16 if we keep it.

but don't remove the Variant::as_u8/16/32/64 for now, we use Variant::as_xx in the shred logic, but Variant::as_xx is not for shred use only, so I don't remove them for now, Variant::as_u8 returns Some(v) for Variant::Int8 but None for other variants. I can remove them if reached to a consensuse that we want to remove Variant::as_u8/16/32/64

}
}
);
impl_primitive_from_variant!(

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align the cast logic with arrow-cast for Time related variant will in a followup pr

@sdf-jkl sdf-jkl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @klion26.

I'm still not on board with keeping as_u* APIs because they are not identity cast (since there's no unsigned Variant and we can't shred to unsigned) but let's leave it to @alamb / @scovich.

Some doc fixes and a behavioral bug for decimal handling.

/// use parquet_variant::{Variant, VariantDecimal4};
/// use parquet_variant::Variant;
///
/// // you can read an int64 variant into an u8

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// // you can read an int64 variant into an u8
/// // you can read an int8 variant into an u8

/// use parquet_variant::{Variant, VariantDecimal4};
/// use parquet_variant::Variant;
///
/// // you can read an int64 variant into an u16

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above:

Suggested change
/// // you can read an int64 variant into an u16
/// // you can read an int16 variant into an u16

@@ -1111,40 +911,28 @@ impl<'m, 'v> Variant<'m, 'v> {
/// use parquet_variant::{Variant, VariantDecimal8};
///
/// // you can read an int64 variant into an u32

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// // you can read an int64 variant into an u32
/// // you can read an int32 variant into an u32

Comment on lines +847 to 848
/// Returns `Some(u8)` for int8 variants that fit in `u8`,
/// `None` for other variants or values that would overflow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns `Some(u8)` for int8 variants that fit in `u8`,
/// `None` for other variants or values that would overflow.
/// Returns `Some(u8)` for a non-negative int8 variant, `None` otherwise.

Same case for as_u* below

///
/// This function is unlike `variant_to_unscaled_decim`, it would never rescale the decimal value,
/// and only return the unscaled integer representation for the specific decimal variants.
pub(crate) fn shred_variant_to_unscaled_decimal<O>(variant: &Variant<'_, '_>) -> Option<O::Native>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function should check that Variant's scale matches the column scale. Otherwise Decimal(123, scale 3) shredded into a Decimal(.., scale 5) column would become .00123 instead of .123.

The question to rescale or not is also debatable.

I digged into spark and currently they rescale decimals and even cast to integers and vice versa:

 // If true, we will shred decimals to a different scale or to integers, as long as they are
 // numerically equivalent. Similarly, integers will be allowed to shred to decimals.

More resources:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix the shred logic in parquet-variant

2 participants